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ABSTRACT 

Students at grades four and five were administered a 
writing assessment that was developed to correspond to the California 
Learning Assessment System (CLAS) writing tasks at grade four. 
Teachers were trained to score the CLAS-like tasks according to the 
rubric developed by the State for CLAS. In addition, 164 students at 
three schools in the Riverside Unified School District, California, 
took part in the CLAS student level pilot in Spring, 1993. A 
general izabi 1 i ty study was conducted using a r one facet, crossed 
design. The outcome was then used to conduct a decision study to 
determine how many tasks would be required to achieve various levels 
of student-level reliability. Comparisons of school level performance 
summaries were made between results from the CLAS assessment at grade 
four and results of the district's CLAS-like assessment. Results 
indicated that performance varied considerably both in terms of 
percents on the score categories as well as in mean scores. Local 
scorers on the CLAS-like tasks tended to place substantially more 
students at both extremes of the rubric than did the scorers of CLAS. 
la many cases, the rank ordering was markedly different. Results also 
indicated that at least five separate tasks scored and averaged would 
be required to achieve adequate levels of reliability. (Contains 
three tables of data.) (Author/RS) 
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Introduction 

The Riverside Unified School District has been administering a direct 
writing program at selected grade levels since 1991. Three writing 
prompts per grade level from Psychological Corporation's Language Arts 
Performance Assessments (LAPA) have comprised the core of the program. 
In addition to these assessments, students at grades four and five are 
administered a writing assessment that was developed to correspond to 
the CLAS writing tasks at grade four. Teachers were trained to score these 
CLAS-like tasks according to the rubric developed by the State for CLAS. 
Scores from this assessment are used in the district to certify whether or 
not students have met AB65 elementary competency standards. 

Administration of these tasks at grade four in conjunction with the 
administration of the CLAS language arts assessment in spring of 1993 
afforded the opportunity to compare performance on the CLAS with 
performance on tasks that were developed to coincide closely with the 
CLAS program. This is particularly relevant since one of the goals of SB662 
was to have school districts implement CLAS compatible tasks at other 
grade levels. The extent to which performance on these tasks is aligned 
with performance on the CLAS itself will in part determine the 
effectiveness of a school district's efforts to assess students at various 
grade levels in comparable ways. 

In addition, 164 students at three of onr schools took part in the CLAS 
student level pilot in Spring, 1993. For these students, individual scores on 
the CLAS and the district's CLAS-like writing assessment are available. In 
order to determine the dependability of scores on the two tasks, a 
Generalizability study (G-study) was conducted using a one facet, crossed 
design. The outcome of this G-study was then used to conduct a Decision 
study (D-study) to determine how many tasks would be required to 
achieve various levels of student level reliability. 
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Method (School level comparisons) 



Comparisons of school level performance summaries were made between 
results from the CLAS assessment at grade four and results of the district's 
CLAS-like assessment. These results are contained in Tables 1 and 2. It is 
clear from this data that performance varied considerably both in terms of 
percents in the score categories as well as in mean scores. Local scorers on 
the CLAS-like tasks tended to place substantially more students at both 
extremes of the rubric than did the scorers of CLAS. 

Table 2 shows the rank ordering of schools based on mean scores for both 
CLAS and the CLAS-like assessments. In many cases the rank ordering 
was markedly different. The correlation between scores tor the twenty-six 
elementary schools was .16. 

Discussion 

It should be noted that a number of factors could have contributed to the 
disparities in scores on the CLAS compared to the CLAS-like assessment at 
fourth grade. First, while the CLAS-like prompts were developed to mirror 
the CLAS as much as possible, the CLAS-like prompts were not integrated 
with reading/group discussion as are the CLAS writing tasks. Second, 
while an effort was made to train teachers to score the CLAS-like writing 
on a six-point rubric mirroring the CLAS scoring procedures, this training 
was not as extensive as that afforded the teachers scoring the CLAS writing 
(two hours vs. four-six hours). Also, for the CLAS-like assessment in RUSD, 
teachers scored their own students 1 papers while focusing primarily on 
Rhetorical Effectiveness. The Writing Conventions element was considered 
only when a reader was unsure ^bout a score. CLAS papers, on the other 
hand, are given separate scores for Rhetorical Effectiveness and 
Conventions which are then weighted to arrive at a composite score. 

Method (Generalizability) 

A one-facet generalizability (G-study) was conducted to establish the 
dependability of the measurements used at fourth grade assuming the 
CLAS and CLAS-like tasks to be interchangeable. A randomized block 
design (Kirk, 1968) was used to determine the variance components of the 
model (Table 3). Because the variance component for items (writing tasks) 
was negative, it was set to zero in accordance with standard 
generalizability procedures^ 
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Tabic 4 provides the variance component estimates, and the 
generalizability coefficients corresponding to student assessments 
consisting of various numbers of tasks like the ones administered in the 
CLAS pilot and the districts CLAS-like program. Although G-coefficients 
are computed differently for relative vs. absolute decisions, the results are 
identical in this case because of the zero variance component for "Items". 
It can be seen that with one task, the G-coefficient is .50. Using a method 
analogous to the Spearman-Brown formula in classical test theory, this 
coefficient can be projected to various numbers of tasks. In order to 
achieve a G-coefficient of .74, three tasks would be required while five 
tasks would be needed to improve the dependability to .83. 

Discussion 

The pattern of these results is consistent with that found in other 
generalizability studies of direct writing although the G-coefficient of .50 
was at the high end of those typically reported for single task writing 
samples. Since the reliability coefficient (analogous to the G-coefficient) 
places an upper limit on the validity coefficient it is important to know 
that the maximum validity obtainable if our writing tasks could be 
correlated to a hypothetical "true writing score" is .71. The coefficient of 
determination is thus .50 indicating that a maximum of 50% of the 
variance in our writing scores can be attributable to the underlying 
achievement trait when only one writing task is given to students. 

Summary 

To those who have studied the technical characteristics of writing 
assessment programs, these results should not be surprising. It is 
important, however, to keep reminding users of direct writing information 
about the dependability of information obtained from a single* sample of 
student work. High stakes decisions should not be based on single 
assessments wi.h this degree of reliability. Because of the simplicity of 
using a single score, school districts often rely on one writing sample to 
make judgments about students including the certification of AB 65 
competency. Unfortunately, there continues to be little psychometric 
justification for doing so. 

Given the reality of what it takes to achieve adequate levels of reliability 
in a standardized performance assessment program (probably at least five 
separate tasks that are scored and averaged), it is not likely that many 
districts will make the commitment to such an extensive, and intrusive, 
assessment program. Especially when writing is only one of several 
content areas that needs to be assessed. In all likelihood, performance 



assessment will remain dichotomized between (a) single performance 
assessments in a couple of content areas for purposes of standardized 
accountability, and (b) a system of informal performance assessments such 
as teacher working portfolios for student level diagnostic feedback. 

One of the ways that informal, classroom-based performance assessments 
might be used is to certify graduation proficiencies. Although this may at 
first glance appear to be more subjective and less defensible than a formal 
writing sample— given the low levels of reliability extant in single sample 
assessments we may be on firmer ground to rely on the expert judgment 
of professionals utilizing a variety of diverse, informal assessments. 



6 

5 



Table 1 

Percents in Performance Levels 
CLAS vs. CLAS-Like 



SCHOOL 
A 



\3 



Percents In CLAS Performance Levels 
L1 L2 L3 L4 L5 



CLAS 2 13 42 33 5 

CLAS-LIKE 7 1 1 33 28 14 

B 

CLAS 0 5 42 3 9 9 

CLAS-LIKE 4 24 2 5 3 0 14 

C 

CLAS 2 26 56 1 2 5 

CLAS-LIKE 0 27 39 20 14 

D 

CLAS 3 15 3 9 3 9 5 

CLAS-LIKE 2 8 1 3 22 27 

E 

CLAS 0 8 3 9 4 4 7 

CLAS-LIKE 1 1 16 27 32 13 

F 

CLAS 0 1 1 3 8 40 1 1 

CLAS-LIKE 0 2 26 34 23 



CLAS 0 20 36 37 5 

CLAS-LIKE 9 1 9 21 22 20 

H 

CLAS 0 7 5 0 37 2 

CLAS-LIKE 0 22 38 29 9 

I 

CLAS 0 11 31 40 16 

CLAS-LIKE 2 1 9 33 21 15 

J 

CLAS 0 13 64 22 0 

CLAS-LIKE 4 22 2 9 27 17 

K 

CLAS 2 20 41 32 3 

CLAS-LIKE 7 16 43 1 8 1 2 
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SCHOOL 



L1 L2 L3 L4 L5 



L 

CLAS 0 5 48 39 7 

CLAS-LIKE 5 21 34 25 10 

M 

CLAS 7 14 . 40 29 5 

CLAS-LIKE 2 1 8 27 33 16 

N 

CLAS 7 9 35 32 1 1 

CLAS-LIKE 9 29 37 18 8 

O 

CLAS 2 2 2 2 51 2 0 

CLAS-LIKE 6 20 25 k4 18 

P 

CLAS 0 1 1 49 38 0 

CLAS-LIKE 1 1 17 3 8 21 1 4 

Q 

CLAS 3 16 5 5 2 2 5 

CLAS-LIKE 7 36 2 6 21 10 

R 

CLAS 0 2 49 47 2 

CLAS-LIKE 4 1 1 25 33 20 

S 

CLAS 3 6 4 7 3 9 6 

CLAS-LIKE 5 12 2 4 36 16 

T 

CLAS 0 7 4 9 36 9 

CLAS-LIKE 12 22 22 26 17 

U 

CLAS A 7 34 42 9 

CLAS-LIKE 3 15 40 27 13 

V 

CLAS 0 7 26 55 9 

CLAS-LIKE 2 6 21 5 9 6 

W 

CLAS 3 0 46 30 1 8 

CLAS-LIKE 9 19 30 22 14 
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SCHOOL 

L1 L2 L3 L4 L5 L6 

X CLAS 0 15 31 36 16 2 

CLAS-LIKE 8 22 29 21 13 6 

Y . 

CLAS 5 1 2 49 27 5 2 

CLAS-LIKE 7 1 1 25 31 21 5 



Z 

CLAS 



2 15 50 30 2 0 

CLAS-LIKE 4 1 7 20 32 25 1 



DISTRICT 
CLAS 

CLAS-LIKE 



2 1 1 43 36 7 0 

6 18 29 27 15 5 
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Table 2 

Mean Scores and Rank Ordering on CLAS and CLAS-Like Writing 

Fourth Grade 



School 


Mean Score 


Mean Score Rank 




Rank 




CLAS 


CLAS-LK CLAS 




CLAS-LK 


A 


3.27 


3.54 


19 


9 


B 


3.55 


3.35 


7 


13.5 


c 


2.92 


3.21 


26 


21 


D 


3.28 


4.48 


1 7 


1 


E 


3.56 


3.20 


6 


22.5 


F 


3.51 


4.23 


9 


2 


G 


3.28 


3.47 


1 7 


1 2 


H 


3.35 


3.31 


14 


16 


1 


3.62 


3.58 


3.5 


8 


J 


3.09 


3.34 


25 


1 5 


K 


3.14 


3.24 


22 


19 


L 


3.48 


3.29 


1 1 


1 7 


M 


3.12 


3.50 


23 


10 


N 


3.33 


2.87 


1 5 


26 


0 


3.88 


3.49 


1 


1 1 


P 


3.28 


3.10 


1 7 


24 


Q 


3.10 


2.94 


24 


25 


R 


3.49 


3.75 


1 0 


4 


S 


3.39 


3.67 


1 3 


5 


T 


3.47 


3.20 


1 2 


22.5 


U 


3.52 


3.35 


8 


13.5 


V 


3.73 


3.79 


2 


3 


w 


3.62 


3.23 


3.5 


20 


X 


3.59 


3.27 


5 


1 8 


Y 


3.21 


3.63 


20 


6 


Z 


3.15 


3.61 


21 


7 



10 
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Table 3 

Analysis of Variance Table and Estimated Variance Components 



Source of variation 


Sum of Squares 


df 


Mean Square 




7 Oil <U1 VC 

component 


Main Effects 
Items (I) 
Persons (P) 


.048 
309.891 


1 

164 


.048 
1.89 




0* 

.628 


2-way interaction 
PI.e 


103.952 


164 


.634 




.634 


Total 


413.891 


329 








* Negative value (-.0035) set equal to zero. 








Table 4 

Estimated Variance Components and Generalizability Coefficients for Different 

Decision Study Designs 


Source of Variation 


1 


2 


Number of Items 
3 


4 


5 


Persons (P) 
Items (I) 
PI 


.628 

0* 

.634 


.628 

0* 

.317 


.628 

0* 

.211 


.628 .628 
0* 0* 
.159 .127 


Generalizability 

Coefficients 
Relative Decisions 
Absolute Decisions** 


.50 
.50 


.66 
.66 


.74 
.74 


.79 
.79 


.83 
.83 


* Negative value (-.0035) set equal to zero. 
** Absolute decisions yield the same value as 


Relative because of zero item variance 
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